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Theorem 1. If f\ > ^ then Algorithm 1 finds f\ such that (1 — e)f\ < f\ < (1 + e)f\ with 
probability at least (1 — 5). The algorithm uses O ^ A l °gW s ) l Q g(jV)^ memory and O(l) time per 
update, where N is the number of elements in the stream. 

The proof relies on the hash function being perfectly random, i.e. all hash values are independent 
random variables. In practice no such hash functions exist, but pairwise independent hash functions 
are close to being perfectly random when the underlying data has entropy [1]. If a hash function is 
5'-close to being perfectly random, the difference in the probability of any event will be at most 5' . 
This allows us to analyze the probability as if the hash function were perfectly random, and simply 
add a 8' to the probability of failure at the end. In practice this difference, 5 1 ', is much smaller than 
the probability of failure. 

Our data structure is a list T of log(n), arrays of length R, where R = 720A l°g( 8 /^) ^ we h ave 
not made any attempts to optimize the constant so as to not further complicate the proof. Each 
value in an array, T w , is a 2-bit counter which can store the numbers from 0 to 3. 

For each k-mer a we compute a hash value h(a), where h is a hash function that gives 64-bit 
values and is guaranteed to give the same value for a and the reverse complement of a. For the 
value h(a) we let w be the highest integer such that 2 w ~ l divides h(a). Equivalently, w is the 
least-significant position that is 1 when h(a) is written in binary. We let j = |_<^ftJ an d increment 
the value in T w [j] by one (if the value is already 3 we do nothing) . 

We can see that the value of w follows a geometric distribution Geo(^). Thus, half the Ai-mers 
will hash to the first array, a quarter to the second etc. To estimate f\ we select the array, w* 
that is closest to being half-full (tu*argmin m \ \{i : T w [i] = 0}| — Suppose N w * distinct fc-mers 
hash to this array of R values. The probability that an element in the array has value 0 is then 
(l — jj) w * , suppose that x\ of these N w * fe-mers are singletons. The only way an element in 
the array has value 1 is if exactly one of the singletons hashed to this location. This occurs with 
probability 
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Let po and p\ be the fraction of elements with values 0 and 1 respectively. Then pb an d Pi are good 
estimates for the probability of a cell having value 0 or 1. We can estimate x\ as 

x \ = (R- 1)4 
Po 
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Since only a of all fc-mers hashed to this particular array we can estimate f\ as 



ft = 2 w *x\ = 2 W "(R-1 



f4 
Po 



Given our choice of R, we need to show that 
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are all true with probability at least (1 — 6). 

Let w' be such that ^ < ^7 — it > thus the expected number of /c-mers that hash to level w' is 
^5" as close to as possible. The observed number of A;-mers mapping to this level is a binomial 
random variable and can be bounded by a simple Chernoff bound 
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Thus w' is the level chosen by our algorithm and j < N w i < ^p. This implies a lower bound on 
the probability po 
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Note that that po>Pi an d £1 are all based on counting number of /c-mers, and are deterministic 
conditional on the hash values of all fe-mers. Changing the hash value of one k-mer can affect x\ 
by at most 1, and po,Pi by at most -k. Since po and p\ are functions of N w * hash values and 
< ^ , this implies that po > ^. Thus using the Azuma-Hoffding inequality we get 
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Given that there are f\ singleton fc-mers and each has a probability ^* of hashing to level w* 
we get, using a regular Chernoff bound 
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Here we have used the inequality /i > ^ 

To show that Pr \\p\ — p\\ > < | we need a lower bound on p\. First note 
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this in turn implies that p\ = ^Po > , assuming e < ^ . 
Using Azuma-Hoffding again we get 
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By the union bound the probability that any of the 4 bounds fail is at most 5 which implies 
that our estimates are accurate to within a factor 1 — 1 with probability at least 1 — 5. 
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Having shown that our estimates are accurate we get, 



/! = 2 w *x x = 2 W *(R-1)^ 

Po 

= (l± £ -) 2 w w * Xl 

= (i±f) 3 A 

= (l±e)/i 

Which shows that the estimator yields an e-approximation with probability at least (1 — 6). 
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